Modelling Classification Performance for Large Data Sets An Empirical Study

نویسندگان

  • Baohua Gu
  • Feifang Hu
  • Huan Liu
چکیده

For many learning algorithms, their learning accuracy will increase as the size of training data increases, forming the well-known learning curve. Usually a learning curve can be fitted by interpolating or extrapolating some points on it with a specified model. The obtained learning curve can then be used to predict the maximum achievable learning accuracy or to estimate the amount of data needed to achieve an expected learning accuracy, both of which will be especially meaningful to data mining on large data sets. Although some models have been proposed to model learning curves, most of them do not test their applicability to large data sets. In this paper, we focus on this issue. We empirically compare six potentially useful models by fitting learning curves of two typical classification algorithms—C4.5 (decision tree) and LOG (logistic discrimination) on eight large UCI benchmark data sets. By using all available data for learning, we fit a full-length learning curve; by using a small portion of the data, we fit a part-length learning curve. The models are then compared in terms of two performances: (1) how well they fit a full-length learning curve, and (2) how well a fitted part-length learning curve can predict learning accuracy at the full length. Experimental results show that the power law (y = a − b ∗ x) is the best among the six models in both the performances for the two algorithms and all the data sets. These results support the applicability of learning curves to data mining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services

The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...

متن کامل

Knowledge discovery from patients’ behavior via clustering-classification algorithms based on weighted eRFM and CLV model: An empirical study in public health care services

The rapid growing of information technology (IT) motivates and makes competitive advantages in health care industry. Nowadays, many hospitals try to build a successful customer relationship management (CRM) to recognize target and potential patients, increase patient loyalty and satisfaction and finally maximize their profitability. Many hospitals have large data warehouses containing customer ...

متن کامل

A Comparative Study of Multipole and Empirical Relations Methods for Effective Index and Dispersion Calculations of Silica-Based Photonic Crystal Fibers

In this paper, we present a solid-core Silica-based photonic crystal fiber (PCF) composed of hexagonal lattice of air-holes and calculate the effective index and chromatic dispersion of PCF for different physical parameters using the empirical relations method (ERM). These results are compared with the data obtained from the conventional multipole method (MPM). Our simulation results reveal tha...

متن کامل

Data envelopment analysis in service quality evaluation: an empirical study

Service quality is often conceptualized as the comparison between service expectations and the actual performance perceptions. It enhances customer satisfaction, decreases customer defection, and promotes customer loyalty. Substantial literature has examined the concept of service quality, its dimensions, and measurement methods. We introduce the perceived service quality index (PSQI) as a sing...

متن کامل

A Russell Measure for Modeling Environmental Performance

Data Envelopment Analysis (DEA) has been long employed as a popular methodology to evaluate the performance of various production activities with multiple inputs and outputs. However, an important issue is that the production process in the real world inevitably generates undesirable outputs (like wastes and pollutants) along with desirable outputs. Therefore, the undesirable outputs should be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001